How to tokenize documents into sentences

3

SAS Text Analytics analyze documents at document-level by default, but sometimes sentence-level analysis gains further insights into the data. Two years ago, SAS Text Analytics team did some research on sentence-level text analysis and shared their discoveries in a SGF paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations. Recently my team started working on a concept extraction project. We need to extract all sentences containing one or two query words, so that linguists don't need to read the whole documents in order to write concept extraction rules. This improves their work efficiency on rules development and rule tuning significantly.

Sentence boundary detection

Sentence boundary detection is a challenge in Natural Language Processing -- it's more complicated than you might expect. For example, most sentences in English end with a period, but sometimes a period is used to denote an abbreviation or used as a part of ellipsis. My colleagues Biljana and Teresa wrote an article about the complexities of how a period may be used. if you are interested in this topic, please check out their article Text analytics through linguists' eyes: When is a period not a full stop?

Sentence boundary rules are different for different languages, and when you work with multilingual data you might want to write one set of code to manipulate all data in varied languages. For example, a period in German is used to denote ending of an ordinal number token; in Chinese, the sentence-final period is different from English period; and Thai does not use period to denote the end of a sentence.

Here are several sentence boundary examples:

Sentences Language Text
1 English Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
2 English I paid $23.45 for this book.
3 English We earn more and more money, but we feel less and less happier. So…what happened to us?
4 Chinese 北京确实人多车多,但是根源在哪里?
5 Chinese 在于首都集中了太多全国性资源。
6 German Was sind die Konsequenzen der Abstimmung vom 12. Juni?

How to tokenize documents into sentences with SAS?

There are several methods to build a sentence tokenizer with SAS Text Analytics. Here I only list three methods:

  • Method 1: Use CAS action applyConcept and SAS Viya
  • Method 2: Use CAS action tpParse and SAS Viya
  • Method 3: Use SAS Data Step Code and SAS 9

Among the above three methods, I recommend the first method, because it can extract sentences and keep the raw texts intact. With the second method, uppercase letters are changed into lowercase letters after parsing with SAS, and some unseen characters will be replaced with white spaces. The third method is based on traditional SAS 9 technology (not SAS Viya), so it might not scale to large data as well.

In my article, I show the SAS code of only the first two methods. For details of the SAS code for the last method, please check out the paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations.

Use CAS action applyConcept and SAS Viya

SAS Visual Text Analytics (on SAS Viya) includes a rich set of CAS actions for categorization, concept extraction and sentiment analysis. The applyConcept action performs concept extraction using a concept extraction model that you compile and validate.

%macro sentenceTokenizer1(
   dsIn=,
   docVar=,
   textVar=,
   language=,
   dsOut=
);
/* Rule for determining sentence boundaries */
data sascas1.concept_rule;
   length rule $ 200;
   ruleId=1;
   rule='ENABLE:SentBoundaries';
   output;
 
   ruleId=2;
   rule='PREDICATE_RULE:SentBoundaries(first,last):(SENT, (SENTSTART_1, "_first{_w}"), (SENTEND_1, "_last{_w}"))';
   output;
run;
 
proc cas;
textRuleDevelop.validateConcept / 
   table={name="concept_rule"}
   config='rule'
   ruleId='ruleId'
   language="&language"
   casOut={name='outValidation',replace=TRUE}
;
run;
quit;
 
/* Compile concept rule; */
proc cas;
textRuleDevelop.compileConcept / 
   table={name="concept_rule"}
   config="rule"
   enablePredefined=false
   language="&language"
   casOut={name="outli", replace=TRUE}
;
run;
quit;
 
/* Get Sentences */
proc cas;
textRuleScore.applyConcept / 
   table={name="&dsIn"}
   docId="&docVar"
   text="&textVar"
   language="&language"
   model={name="outli"}
   matchType="best"
   casOut={name="outpos_eli", replace=TRUE}
   factOut={name="&dsOut", replace=TRUE, where="_fact_argument_=''"}
;
run;
quit;
 
proc cas;
   table.dropTable name="concept_rule" quiet=true; run;
   table.dropTable name="outli" quiet=true; run;
   table.dropTable name="outpos_eli" quiet=true; run;
quit; 
%mend sentenceTokenizer1;

Use CAS action tpParse and SAS Viya

If you don't have SAS Visual Text Analytics, then SAS Visual Analytics includes CAS actions for more basic parsing and categorization, including an NLP technique called tpParse.

%macro sentenceTokenizer2(
   dsIn=,
   docVar=,
   textVar=,
   language=,
   dsOut=
);
/* Parse the data set */
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   selectAttribute={opType="IGNORE",tagList={}}
   selectPos={opType="IGNORE",tagList={}}
   offset={name="offset",replace=TRUE}
;
run;
 
/* Get Sentences */
proc cas;
table.partition / 
   table={name="offset" 
          groupby={{name="_document_"}, {name="_sentence_"}}
          orderby={{name="_start_"}}
         }
   casout={name="offset" replace=true};
run;
 
datastep.runCode /
code= "
data &dsOut;
   set offset;
   by _document_ _sentence_ _start_;
   length _text_ varchar(20000);
   if first._sentence_ then do;
      _text_='';
      _lag_end_ = -1;
   end;  
   if _start_=_lag_end_+1 then
      _text_=cats(_text_, _term_);
   else
      _text_=trim(_text_)||repeat(' ',_start_-_lag_end_-2)||_term_;
   _lag_end_=_end_;  
   if last._sentence_ then output;
   retain _text_ _lag_end_;
   keep _document_ _sentence_ _text_;
run;
";
run;   
quit;
 
proc cas;
   table.dropTable name="offset" quiet=true; run;
quit; 
%mend sentenceTokenizer2;

Here are three examples for using each of these tokenizer methods:

/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
/*-------------------------------------*/
/* Example 1: Chinese texts            */
/*-------------------------------------*/
data sascas1.text_zh;
   infile cards dlm='|' missover;
   input _document_ text :$200.;
   cards;
1|北京确实人多车多,但是根源在哪里?在于首都集中了太多全国性资源。
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_zh,
   docVar=_document_,
   textVar=text,
   language=chinese,
   dsOut=sentences_zh1
);
 
%sentenceTokenizer2(
   dsIn=text_zh,
   docVar=_document_,
   textVar=text,
   language=chinese,
   dsOut=sentences_zh2
);
 
/*-------------------------------------*/
/* Example 2: English texts            */
/*-------------------------------------*/
data sascas1.text_en;
   infile cards dlm='|' missover;
   input _document_ text :$500.;
   cards;
1|Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
2|I paid $23.45 for this book.
3|We earn more and more money, but we feel less and less happier. So…what happened to us?
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_en,
   docVar=_document_,
   textVar=text,
   language=english,
   dsOut=sentences_en1
);
 
%sentenceTokenizer2(
   dsIn=text_en,
   docVar=_document_,
   textVar=text,
   language=english,
   dsOut=sentences_en2
);
 
 
/*-------------------------------------*/
/* Example 3: German texts             */
/*-------------------------------------*/
data sascas1.text_de;
   infile cards dlm='|' missover;
   input _document_ text :$600.;
   cards;
1|Was sind die Konsequenzen der Abstimmung vom 12. Juni?
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_de,
   docVar=_document_,
   textVar=text,
   language=german,
   dsOut=sentences_de1
);
 
%sentenceTokenizer2(
   dsIn=text_de,
   docVar=_document_,
   textVar=text,
   language=german,
   dsOut=sentences_de2
);

The sentences extracted of the three examples as Table 2 shows below.

Example Doc Text Sentence (Method 1) Sentence (Method 2)
English

 

1 Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. rolls-royce motor cars inc. said it expects its u.s. sales to remain steady at about 1,200 cars in 1990.
2 I paid $23.45 for this book. I paid $23.45 for this book. i paid $23.45 for this book.
3 We earn more and more money, but we feel less and less happier. So…what happened to us? We earn more and more money, but we feel less and less happier. we earn more and more money, but we feel less and less happier.
So…what happened? so…what happened?
Chinese

 

1 北京确实人多车多,但是根源在哪里?在于首都集中了太多全国性资源。 北京确实人多车多,但是根源在哪里? 北京确实人多车多,但是根源在哪里?
在于首都集中了太多全国性资源。 在于首都集中了太多全国性资源。
German 1 Was sind die Konsequenzen der Abstimmung vom 12. Juni? Was sind die Konsequenzen der Abstimmung vom 12. Juni? was sind die konsequenzen der abstimmung vom 12. juni?

From the above table, you can see that there is no difference between two methods with Chinese textual data, but many differences between two methods with English or German textual data. So which method you should use? It depends on the SAS products that you have available. Method 1 depends on compileConcept, validateConcept, and applyConcept actions, and requires SAS Visual Text Analytics. Method 2 depends on the tpParse action in SAS Visual Analytics. If you have both products available, then consider your use case. If you are working on text analytics that are case insensitive, such as topic detection or text clustering, you may choose method 2. Otherwise, if the text analytics are case sensitive such as named entity recognition, you must choose method 1. (And of course, if you don't have SAS Viya, you can use method 3 with SAS 9 and guidance from the cited paper.)

If you have SAS Viya, I suggest trying the above sentence tokenization method with your data and then run text mining actions on the sentence-level data to see what insights you will get.

Share

About Author

Emily Gao

Director Analytics Development Groups, SAS Beijing R&D

Emily Gao manages two product lines: Decision Manager and Text Analytics. She received her data mining masters degree in 2002, and has been with SAS for 15-years.

3 Comments

  1. Ann,

    Thank you for sharing this more efficient rule for sentence tokenization.
    I modified the original rule with yours.

    Thanks,
    Emily

  2. When I ran %sentenceTokenizer1 with the sample news table which contains 600 news article, the applyConcept action ran indefinitely. This is because the part of the rule section below does not use the right operators:

    PREDICATE_RULE:SentBoundaries(first,last):(SENT,"_first{_w}","_last{_w}")

    In order to find the first word in a sentence, you must use the SENTSTART_n operator with n=1, and the same goes with the last word in the sentence (using the SENTEND_n operator):

    (SENT, (SENTSTART_1, "_first{_w}"), (SENTEND_1, "_last{_w}"))

    Using the code below, it took less than 2 seconds to run applyConcept action on Viya 3.4( with CAS in SMP):

    cas;
    caslib _all_ assign;
    libname mylib "/home/sasdemo/data";
    libname sascas1 cas caslib=casuser;

    data casuser.news;
    set mylib.news;
    _document_ = _n_;
    run;

    %let language=English;
    %let dsIn=news;
    %let docVar=_document_;
    %let textVar=text;
    %let language=English;
    %let dsOut=sentences1;

    /* Rule for determining sentence boundaries */
    data sascas1.concept_rule;
    length rule $ 200;
    ruleId=1;
    rule='ENABLE:SentBoundaries';
    output;
    ruleId=2;
    /* rule='PREDICATE_RULE:SentBoundaries(first,last):(SENT,"_first{_w}","_last{_w}")'; */
    rule='PREDICATE_RULE:SentBoundaries(first,last):(SENT, (SENTSTART_1, "_first{_w}"), (SENTEND_1, "_last{_w}"))';
    output;
    run;

    proc cas;
    textRuleDevelop.validateConcept /
    table={name="concept_rule"}
    config='rule'
    ruleId='ruleId'
    language="&language"
    casOut={name='outValidation',replace=TRUE}
    ;
    run;
    quit;

    /* Compile concept rule; */
    proc cas;
    textRuleDevelop.compileConcept /
    table={name="concept_rule"}
    config="rule"
    enablePredefined=false
    language="&language"
    casOut={name="outli", replace=TRUE}
    ;
    run;
    quit;

    /* Get Sentences */
    proc cas;
    textRuleScore.applyConcept /
    table={name="&dsIn"}
    docId="&docVar"
    text="&textVar"
    language="&language"
    model={name="outli"}
    matchType="best"
    casOut={name="outpos_eli", replace=TRUE}
    factOut={name="&dsOut", replace=TRUE, where="_fact_argument_=''"}
    ;
    run;
    quit;

    The following is the log from running the applyConcept action:

    134 proc cas;
    135 textRuleScore.applyConcept /
    136 table={name="news_noMiss"}
    137 docId="_document_"
    138 text="text"
    139 language="English"
    140 model={name="outli2"}
    141 matchType="best"
    142 casOut={name="outpos_eli", replace=TRUE}
    143 factOut={name="sentence_out2", replace=TRUE /*, where="_fact_argument_=''" */}
    144 ;
    145 run;
    NOTE: Active Session now CASAUTO.

    146 quit;

    NOTE: PROCEDURE CAS used (Total process time):
    real time 1.60 seconds
    cpu time 0.05 seconds

    In the Results, the rule is validated without errors. Also, the factout= (sentence1) table contains 6767 rows. Note that the casout= (outpos_eli) table contains no rows. This is expected as you should only get a fact table and no concept table with this rule.

    Hope this helps!

    Ann

  3. Hi Emily,

    This is a great article and I wish I had read this earlier. I just wanted to add that the _result_id_ in the fact matches order the sentences in reverse order, i.e. the last sentence in the document is shown as result id = 1 and the first sentence is the last in result id. But if you order the matches by doc id and by _start_ then it is not a problem. We just have to be careful on how we interpret the _result_id_.

    regards
    Sundaresh

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top